Hands-on Exercise 05c

Author

Andrea Yeo

Modified

February 11, 2025

With the assistance of ChatGPT

5. Heatmap for Visualising and Analysing Multivariate Data

5.1 Overview

Heatmaps use color variations to visualize data patterns in a tabular format. They are useful for examining multivariate data, where columns represent variables and rows represent observations.

Key Benefits of Heatmaps:

  • Show variance across multiple variables.

  • Reveal patterns and relationships between variables.

  • Identify similar variables and potential correlations

In this hands-on, we will learn how to create both static and interactive heatmaps using R for data visualization and analysis.

5.2 Installing and Launching R Packages

We will install and load the following packages in R:

  • seriation – For data ordering and clustering
  • heatmaply – For creating interactive heatmaps
  • dendextend – For enhancing dendrograms
  • tidyverse – For data manipulation and visualization
Code
pacman::p_load(seriation, dendextend, heatmaply, tidyverse, gplots)

5.3 Importing and preparing the data set

This exercise uses the World Happiness Report 2018 dataset, extracted from Excel and saved as WHData-2018.csv, for heatmap visualization and analysis in R.

5.3.1 Importing the dataset

The read_csv() function from readr is used to import WHData-2018.csv into R, converting it into a tibble data frame for easier analysis.

Code
wh <- read_csv("data/WHData-2018.csv")

We will check the dataset using below

  • glimpse(): provides a transposed overview of a dataset, showing variables and their types in a concise format.
  • head(): displays the first few rows of a dataset (default is 6 rows) to give a quick preview of the data.
  • summary(): generates a statistical summary of each variable, including measures like mean, median, and range for numeric data.
  • duplicated():returns a logical vector indicating which elements or rows in a vector or data frame are duplicates.
  • Sum(is.na()): counts the number of missing values (NA) in each column of the data frame.
  • spec(): use spec() to quickly inspect the column
Code
glimpse(wh)
Rows: 156
Columns: 12
$ Country                        <chr> "Albania", "Bosnia and Herzegovina", "B…
$ Region                         <chr> "Central and Eastern Europe", "Central …
$ `Happiness score`              <dbl> 4.586, 5.129, 4.933, 5.321, 6.711, 5.73…
$ `Whisker-high`                 <dbl> 4.695, 5.224, 5.022, 5.398, 6.783, 5.81…
$ `Whisker-low`                  <dbl> 4.477, 5.035, 4.844, 5.244, 6.639, 5.66…
$ Dystopia                       <dbl> 1.462, 1.883, 1.219, 1.769, 2.494, 1.45…
$ `GDP per capita`               <dbl> 0.916, 0.915, 1.054, 1.115, 1.233, 1.20…
$ `Social support`               <dbl> 0.817, 1.078, 1.515, 1.161, 1.489, 1.53…
$ `Healthy life expectancy`      <dbl> 0.790, 0.758, 0.712, 0.737, 0.854, 0.73…
$ `Freedom to make life choices` <dbl> 0.419, 0.280, 0.359, 0.380, 0.543, 0.55…
$ Generosity                     <dbl> 0.149, 0.216, 0.064, 0.120, 0.064, 0.08…
$ `Perceptions of corruption`    <dbl> 0.032, 0.000, 0.009, 0.039, 0.034, 0.17…
Code
head(wh)
# A tibble: 6 × 12
  Country         Region `Happiness score` `Whisker-high` `Whisker-low` Dystopia
  <chr>           <chr>              <dbl>          <dbl>         <dbl>    <dbl>
1 Albania         Centr…              4.59           4.70          4.48     1.46
2 Bosnia and Her… Centr…              5.13           5.22          5.04     1.88
3 Bulgaria        Centr…              4.93           5.02          4.84     1.22
4 Croatia         Centr…              5.32           5.40          5.24     1.77
5 Czech Republic  Centr…              6.71           6.78          6.64     2.49
6 Estonia         Centr…              5.74           5.82          5.66     1.46
# ℹ 6 more variables: `GDP per capita` <dbl>, `Social support` <dbl>,
#   `Healthy life expectancy` <dbl>, `Freedom to make life choices` <dbl>,
#   Generosity <dbl>, `Perceptions of corruption` <dbl>
Code
summary(wh)
   Country             Region          Happiness score  Whisker-high  
 Length:156         Length:156         Min.   :2.905   Min.   :3.074  
 Class :character   Class :character   1st Qu.:4.454   1st Qu.:4.590  
 Mode  :character   Mode  :character   Median :5.378   Median :5.478  
                                       Mean   :5.376   Mean   :5.479  
                                       3rd Qu.:6.168   3rd Qu.:6.260  
                                       Max.   :7.632   Max.   :7.695  
  Whisker-low       Dystopia     GDP per capita   Social support 
 Min.   :2.735   Min.   :0.292   Min.   :0.0000   Min.   :0.000  
 1st Qu.:4.345   1st Qu.:1.654   1st Qu.:0.6162   1st Qu.:1.077  
 Median :5.285   Median :1.909   Median :0.9495   Median :1.262  
 Mean   :5.273   Mean   :1.923   Mean   :0.8874   Mean   :1.217  
 3rd Qu.:6.051   3rd Qu.:2.270   3rd Qu.:1.1978   3rd Qu.:1.463  
 Max.   :7.569   Max.   :2.961   Max.   :1.6490   Max.   :1.644  
 Healthy life expectancy Freedom to make life choices   Generosity    
 Min.   :0.0000          Min.   :0.0000               Min.   :0.0000  
 1st Qu.:0.4223          1st Qu.:0.3583               1st Qu.:0.1095  
 Median :0.6440          Median :0.4940               Median :0.1740  
 Mean   :0.5980          Mean   :0.4570               Mean   :0.1816  
 3rd Qu.:0.7772          3rd Qu.:0.5800               3rd Qu.:0.2422  
 Max.   :1.0300          Max.   :0.7240               Max.   :0.5980  
 Perceptions of corruption
 Min.   :0.0000           
 1st Qu.:0.0510           
 Median :0.0820           
 Mean   :0.1125           
 3rd Qu.:0.1390           
 Max.   :0.4570           
Code
wh[duplicated(wh),]
# A tibble: 0 × 12
# ℹ 12 variables: Country <chr>, Region <chr>, Happiness score <dbl>,
#   Whisker-high <dbl>, Whisker-low <dbl>, Dystopia <dbl>,
#   GDP per capita <dbl>, Social support <dbl>, Healthy life expectancy <dbl>,
#   Freedom to make life choices <dbl>, Generosity <dbl>,
#   Perceptions of corruption <dbl>
Code
sum(is.na(wh))  
[1] 0
Code
spec(wh)
cols(
  Country = col_character(),
  Region = col_character(),
  `Happiness score` = col_double(),
  `Whisker-high` = col_double(),
  `Whisker-low` = col_double(),
  Dystopia = col_double(),
  `GDP per capita` = col_double(),
  `Social support` = col_double(),
  `Healthy life expectancy` = col_double(),
  `Freedom to make life choices` = col_double(),
  Generosity = col_double(),
  `Perceptions of corruption` = col_double()
)

The wh tibble contains 12 attributes, as shown above:

  • Categorical attributes: Country, Region

  • Continuous attributes: Happiness score, Whisker-high, Whisker-low, Dystopia, GDP per capita, Social support, Healthy life expectancy, Freedom to make life choices, Generosity, Perceptions of corruption

5.3.2 Preparing the data

Next, the rows are renamed using country names instead of row numbers,

Code
row.names(wh) <- wh$Country

The row number has been replaced into the country name.

5.3.3 Transforming the data frame into a matrix

Since the data was loaded as a data frame, it needs to be converted into a data matrix for heatmap visualization.

The code below transforms the wh data frame into a matrix format suitable for plotting.

Code
wh1 <- dplyr::select(wh, c(3, 7:12))
wh_matrix <- data.matrix(wh)

wh_matrix is in R matrix format.

5.4 Static heatmap

There are several R packages provide functions for creating static heatmaps:

  • heatmap() (R Stats) – Basic heatmap function.
  • heatmap.2() (gplots) – Enhanced version with more features.
  • pheatmap() (pheatmap) – “Pretty Heatmap” with customizable aesthetics.
  • ComplexHeatmap (Bioconductor) – Advanced heatmaps, useful for genomic data. The package’s reference guide is available here
  • superheat – Customizable heatmaps for exploring complex datasets. The package’s reference guide is available here.

This section focuses on plotting static heatmaps using heatmap()

5.4.1 Heatmap() of R stats

This section shows how to create a heatmap using heatmap() from Base R Stats, with the provided code chunk.

Code
wh_heatmap <- heatmap(wh_matrix,
                      Rowv=NA, Colv=NA)

Note
  • By default, heatmap() creates a clustered heatmap.
  • Setting Rowv = NA and Colv = NA disables row and column dendrograms.

To plot a cluster heatmap, we just have to use the default as shown in the code below.

Code
wh_heatmap <- heatmap(wh_matrix)

Note
  • The row and column order in the heatmap differs from the original wh_matrix due to clustering.
  • heatmap() reorders data by calculating distances between rows and columns, grouping similar values together.
  • Dendrograms are displayed alongside the heatmap to visualize hierarchical clustering.

In this heatmap, red cells represent smaller values, while larger values are also red, making the visualization less informative. The Happiness Score variable has relatively high values, causing other variables with smaller values to appear similar. To improve clarity, the matrix needs to be normalized using the scale argument, which can be applied to either rows or columns based on the analysis needs.

The code below normalises the matrix column-wise.

Code
wh_heatmap <- heatmap(wh_matrix,
                      scale="column",
                      cexRow = 0.6, 
                      cexCol = 0.8,
                      margins = c(10, 5))

Note
  • The values are now scaled for better visualization.
  • margins ensures that x-axis labels are fully displayed.
  • cexRow adjusts the font size of y-axis labels.
  • cexCol adjusts the font size of x-axis labels.

5.5 Creating interactive heatamp

heatmaply is an R package for creating interactive cluster heatmaps, which can be shared as stand-alone HTML files. Developed and maintained by Tal Galili, it offers powerful visualization capabilities.

It is recommended to review the Introduction to Heatmaply for an overview of its features and functions, and the user manual will also be available for reference.

We will use heatmaply to design an interactive cluster heatmap. We will still use the wh_matrix as the input data.

5.5.1 Working with heatmaply

Code
heatmaply(mtcars)

The below code create an interactive heatmap by using heatmaply package.

Code
heatmaply(wh_matrix[, -c(1, 2, 4, 5)])
Note
  • Unlike heatmap(), heatmaply() places the horizontal dendrogram on the left side of the heatmap, while row labels appear on the right
  • If the x-axis labels are too long, they are automatically rotated 135 degrees for better readability.

5.5.2 Data transformation methods in heatmaply()

When analyzing multivariate datasets, variables often have different measurement scales, making direct comparisons difficult. To address this, data transformation is commonly applied before clustering.

heatmaply() supports three main transformation methods:

  1. Scaling (scale)

    • Suitable for variables from a normal distribution.

    • Standardizes data by subtracting the mean and dividing by the standard deviation.

    • Values reflect distance from the mean in standard deviation units.

    • Supports column-wise or row-wise scaling.

  2. Normalizing (normalize)

    • Used when variables come from different or non-normal distributions.

    • Rescales values to a 0 to 1 range by subtracting the minimum and dividing by the maximum.

    • Preserves the shape of the original distribution while making values comparable.

  3. Percentizing (percentize)

    • Similar to ranking, but converts values into percentiles.

    • Uses the empirical cumulative distribution function (ecdf) to transform values.

    • Provides an intuitive interpretation: each value represents the percentage of observations at or below it.

These transformation methods ensure that variables with different scales can be effectively compared and clustered in heatmaps

5.5.2.1 Scaling method

Code below is used to scale variable values columewise.

Code
heatmaply(wh_matrix[, -c(1, 2, 4, 5)],
          scale = "column")
5.5.2.2 Normalising method

Different from Scaling, the normalise method is performed on the input data set i.e. wh_matrix as shown in the code below.

Code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]))
5.5.2.3 Percentising method

Similar to Normalize method, the Percentize method is also performed on the input data set i.e. wh_matrix as shown in the code chunk below.

Code
heatmaply(percentize(wh_matrix[, -c(1, 2, 4, 5)]))

5.5.3 Clustering algorithm

heatmaply() supports various hierarchical clustering algorithms, allowing customization through key parameters:

  • distfun – Defines the function for computing distance (dissimilarity) between rows and columns.

    • Default: dist (Euclidean distance).

    • Options: “pearson”, “spearman”, “kendall” (correlation-based clustering).

  • hclustfun – Specifies the function for hierarchical clustering when dendrograms are not provided.

    • Default: hclust (standard hierarchical clustering).
  • dist_method – Controls the distance metric used for clustering.

    • Default: “euclidean”.

    • Options: “maximum”, “manhattan”, “canberra”, “binary”, “minkowski”.

  • hclust_method – Determines the clustering linkage method.

    • Default: “complete”.

    • Options: “ward.D”, “ward.D2”, “single”, “complete”, “average” (UPGMA), “mcquitty” (WPGMA), “median” (WPGMC), “centroid” (UPGMC).

Clustering models can be fine-tuned manually or calibrated statistically for optimal results.

5.5.4 Manual approach

In the code chunk below, the heatmap is plotted by using hierachical clustering algorithm with “Euclidean distance” and “ward.D” method.

Code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          dist_method = "euclidean",
          hclust_method = "ward.D")

5.5.5 Statistical approach

To determine the best clustering method and optimal number of clusters, the dendextend package provides two key functions:

  • dend_expend() – Identifies the recommended clustering method based on data structure.
  • find_k() – Helps determine the optimal number of clusters.

First, dend_expend() is used to analyze the dataset and suggest the most suitable clustering method for hierarchical clustering

Code
wh_d <- dist(normalize(wh_matrix[, -c(1, 2, 4, 5)]), method = "euclidean")
dend_expend(wh_d)[[3]]
  dist_methods hclust_methods     optim
1      unknown         ward.D 0.6137851
2      unknown        ward.D2 0.6289186
3      unknown         single 0.4774362
4      unknown       complete 0.6434009
5      unknown        average 0.6701688
6      unknown       mcquitty 0.5020102
7      unknown         median 0.5901833
8      unknown       centroid 0.6338734

The output table shows that “average” method should be used because it gave the high optimum value.

Next, find_k() is used to determine the optimal number of cluster.

Code
wh_clust <- hclust(wh_d, method = "average")
num_k <- find_k(wh_clust)
plot(num_k)

Figure above shows that k=3 would be good.

With reference to the statistical analysis results, we can prepare the code chunk as shown below.

Code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          dist_method = "euclidean",
          hclust_method = "average",
          k_row = 3)

5.5.6 Seriation

One limitation of hierarchical clustering is that it does not impose a strict row order, only a constraint on possible arrangements. For example, given items A, B, and C, a tree structure like ((A+B)+C) ensures that C won’t be placed between A and B, but it does not specify whether the order should be ABC or BAC for better visualization.

To address this, heatmaply uses the seriation package to optimize row and column ordering. It applies the Optimal Leaf Ordering (OLO) algorithm, which:

  • Starts with agglomerative clustering results.
  • Rotates dendrogram branches to minimize dissimilarity between adjacent leaves.
  • Optimizes the Hamiltonian path length, a restricted form of the Traveling Salesman Problem (TSP).

Applying OLO results in a clearer heatmap by minimizing the sum of distances between adjacent elements while preserving the hierarchical structure.

Code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          seriate = "OLO")

The default ordering method in heatmaply is “OLO” (Optimal Leaf Ordering), which optimizes row and column arrangements but has a computational complexity of O(n⁴).

The other alternatives are:

OLO optimizes row and column arrangements but has a computational complexity of O(n⁴)

Code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          seriate = "OLO")

GW aims for the same goal as OLO but uses a potentially faster heuristic.

Code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          seriate = "GW")

The option “mean” gives the output we would get by default from heatmap functions in other packages such as gplots::heatmap.2.

Code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          seriate = "mean")

The option “none” gives us the dendrograms without any rotation that is based on the data matrix.

Code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          seriate = "none")

5.5.7 Working with color palettes

The default colour palette uses by heatmaply is viridis. heatmaply users, however, can use other colour palettes in order to improve the aestheticness and visual friendliness of the heatmap.

Code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          seriate = "none",
          colors = viridis)
Code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          seriate = "none",
          colors = Blues)
Code
library(RColorBrewer)

heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]), 
          seriate = "none", 
          colors = brewer.pal(9, "RdYlBu")) 
Code
library(RColorBrewer)

heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]), 
          seriate = "none", 
          colors = colorRampPalette(c("green", "yellow", "red"))(200))

5.5.8 The finishing touch

heatmaply() offers various features to enhance both statistical analysis and visual quality of heatmaps.

In the provided code:

  • k_row = 5 → Creates five row clusters.
  • margins = c(60, 200, 0, 0) → Adjusts top (60) and row (200) margins for better label visibility.
  • fontsize_row = 4, fontsize_col = 4 → Sets row and column label font size.
  • main → Defines the plot title.
  • xlab / ylab → Labels the x-axis and y-axis.
Code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          Colv=NA,
          seriate = "none",
          colors = Blues,
          k_row = 5,
          margins = c(NA,200,60,NA),
          fontsize_row = 4,
          fontsize_col = 5,
          main="World Happiness Score and Variables by Country, 2018 \nDataTransformation using Normalise Method",
          xlab = "World Happiness Indicators",
          ylab = "World Countries"
          )

5.6 References

6.9 Takeaway

Key takeaways
  • Learnt about static Heatmaps: heatmap() (Base R) and heatmap.2() (gplots)
  • Learnt about interactive Heatmaps: heatmaply() allows exploration with scaling, normalizing, and percentizing options.
  • Learnt about clustering: Uses hierarchical clustering (dist_method, hclust_method) to group similar data.
  • Learnt about seriation: Optimizes row/column ordering for better clarity (OLO, GW, mean).

5.7 Further exploration

  1. To explore heatmaply_cor function, which is a wrapper around heatmaply with arguments optimised for use with correlation matrices.
Code
heatmaply_cor(
  cor(mtcars),
  xlab = "Features",
  ylab = "Features",
  k_col = 2,
  k_row = 2
)
  • Highlight Strong and Weak Correlations
  • In this case, correlations below 0.7 (absolute value) appear as white - filtering out weak correlations
Code
# Compute correlation matrix
cor_matrix <- cor(mtcars, use = "complete.obs")

# Define threshold for highlighting strong correlations
threshold <- 0.7  

# Create a masked color matrix where weak correlations are replaced with NA
masked_colors <- ifelse(abs(cor_matrix) >= threshold, cor_matrix, NA)

masked_colors[is.na(masked_colors)] <- 0  # Replace NA with 0

heatmaply_cor(
  masked_colors,
  xlab = "Features",
  ylab = "Features",
  k_col = 2,
  k_row = 2,
  main = "Heatmap with Highlighted Strong Correlations",
  cellnote = round(cor_matrix, 2),  # Show all values
  colors = colorRampPalette(c("blue", "white", "red"))(200),
  limits = c(-1, 1),
  na.value = "grey90"
)
  • Include p-values helps identify statistically significant correlations.
  • Adds “*” for significant correlations, in this case, p value < 0.05
Code
library(heatmaply)

# Function to compute correlation matrix with p-values
cor_pval_matrix <- function(df) {
  cor_test <- function(x, y) cor.test(x, y)$p.value
  p_mat <- outer(colnames(df), colnames(df), Vectorize(function(i, j) cor_test(df[[i]], df[[j]])))
  rownames(p_mat) <- colnames(p_mat) <- colnames(df)
  return(p_mat)
}

# Compute correlation and p-values
cor_matrix <- cor(mtcars)
p_matrix <- cor_pval_matrix(mtcars)

# Plot heatmap with p-value annotations
heatmaply_cor(
  cor_matrix,
  xlab = "Features",
  ylab = "Features",
  k_col = 2,
  k_row = 2,
  main = "Correlation Heatmap with P-values",
  cellnote = ifelse(p_matrix < 0.05, "*", ""),  # Adds "*" for significant correlations
  colors = viridis::viridis(100)
)
  • p-value from the correlation test is mapped to point size.
Code
r <- cor(mtcars)
## We use this function to calculate a matrix of p-values from correlation tests
## https://stackoverflow.com/a/13112337/4747043
cor.test.p <- function(x){
    FUN <- function(x, y) cor.test(x, y)[["p.value"]]
    z <- outer(
      colnames(x), 
      colnames(x), 
      Vectorize(function(i,j) FUN(x[,i], x[,j]))
    )
    dimnames(z) <- list(colnames(x), colnames(x))
    z
}
p <- cor.test.p(mtcars)

heatmaply_cor(
  r,
  node_type = "scatter",
  point_size_mat = -log10(p), 
  point_size_name = "-log10(p-value)",
  label_names = c("x", "y", "Correlation")
)
  1. To explore text annotations

heatmaply supports the cellnote argument, which allows overlaying text values on the heatmap. By default, the text color is automatically adjusted for readability—black text on light cells and white text on dark cells.

Code
heatmaply(
  mtcars,
  cellnote = mtcars
)
  1. To explore the different ways of visualizing - Correlation Matrix Visualization
Code
library(ggplot2)
library(reshape2)

# Prepare data for heatmap
melted_data <- melt(cor(mtcars))

# Plot heatmap with improved clarity
ggplot(melted_data, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "black") +  # Add borders to each cell
  geom_text(aes(label = round(value, 2)), color = "white", size = 3) +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red") + 
  theme_minimal() + 
  labs(title = "Correlation Heatmap")